OC IA P6 - Yelp - computer vision part

This notebook aims at working on ORB descriptors of a set of pictures, to try to find insights about their similarities and differences.

Get data

Among the 200 000 photos we got from Yelp dataset, we manually selected 200 for each of following categories:

Each picture is labelled to its corresponding category:

Let's have a look at random pictures:

Find descriptors for each picture

We'll use the ORB algorithm, which is similar to SIFT and SURF but open-source. Its principle is the same: find keypoints in the image, and describe them in a way that it is invariant to blurring, resizing, rotating, etc.

Let's test on a random picture:

Group descriptors as "visual words"

If our descriptors were words, there would only be one way to write each of them (after lemmatization). For instance "dive" is always written that way, and "dove" is a completely different word, even if they only differ by a single letter.

But our descriptors are not that precise: for instance, if we take 2 different pictures of the Eiffel Tower and look at the descriptors corresponding to its peak in each photo, they will be similar, but not necessarily identical. This is equivalent to get a picture described by descriptors 'chaair' and 'chaiir'. So we have to find the "archetypal descriptor" (in our comparison: "chair"), i.e. a descriptor that represents its closest neighbors.

This leads us to try to group our whole lot of descriptors into clusters with a KMeans.

Note : we could also use these descriptors for image recognition since we have labelled data, see this article.

Create descriptors for all sample pictures

Now we're going to create descriptors for each picture in the dataset.

We'll also add keypoints for each picture, for future visualizations.

Once we'll have this list, next step will be to find the "archetypal descriptors" of these descriptors, by performing a clustering: each centroid will be representative of the descriptors that belong to its cluster.

Descriptors clustering

We'll use KMeans for clustering. As a rule of thumb, we'll set the clusters number k equal to square root of total number of descriptors. In other words, if we got N descriptors, we'll then end up with a "vocabulary" of $ \sqrt N $ visual words.

Feature creation ("bag of visual words")

As said above, if we describe an image by its descriptors, similar images may end being described by slightly different descriptors. As a consequence, they may not be identified as similar (as if we were using the descriptors "cchair" and "chaair" not noticing they are close to each other).

So we will instead describe this image using the "archetypal descriptors", i.e. the centroids of clusters predicted for each descriptor ("chair" in our comparison) by the kmeans model we just trained to group them.

Doing so, we are likely to reduce the number of descriptors of an image, since several descriptors may belong to the same cluster (i.e. may be represented by the same "archetypal descriptor" / centroid).

Once an image is processed, we can "sum it up" by computing the frequency of its archetypal descriptors. This will give us a vector which size will be the number of these archetypal descriptors.

If we keep up with our word comparison (that is to say, let's imagine that our descriptors are strings of letters instead of arrays of numbers), the whole process could be seen as this :

Conclusion: at first we needed aroud 500 descriptors to describe our picture, but thanks to clustering the descriptors and computing their frequency, we end up with a much smaller dimension.

This process can be seen as the equivalent of creating a BOW, based on a vocabulary made of these "archetypal descriptors". But it implies an extra preparatory step to create them by finding the centroids of descriptors clusters.

Visualization of bag of visual words

We'll use three different visualization methods: Principal Components Analysis, t-SNE and UMAP, to get a 2D projection of these bag of visual word.

Data extraction and normalization

Let's normalize the BOVW we just created:

PCA

Compute principal components

What proportion of variance do the 10 first components explain ?

Projection of first 2 components

t-SNE

UMAP

Comments

In all projections, we observe that images of menu, in red, is rather distinct from other categories, while the other categories remain quite mixed. An hypothesis is that the menu images features are more specific (many horizontal lines of letters) than other images features.

Keypoints visualization for "menu" category

We saw that keypoints seem to be particularly well identified for the "menu" category. How do they look like on the pictures ?

Let's select the most signifiant visual words for this category, i.e. the visual words that are:

Then, on a sample of menu pictures, we'll display the corresponding keypoints.

Get the upper 10% of most frequent visual words, in all categories pictures

We can see them as "stop labels", equivalent of stop words in NLP.

get the 50 most frequent visual words in "menu" pictures

We select those that are not in stop_labels.

Display these significant visual words on random menu pictures

Are they really specific of menu pictures?

If these keypoints are representative of the menu category, we may not find them (or in a much smaller number) in other categories pictures. Let's try on a sample of 10 pictures from other categories than "menu":

Now we can answer our previous question: are there more "menu" keypoints in "menu" pictures than in other pictures? If yes, this may confirm that these keypoints are discriminant, and explain why the 2D projection shows menu pictures in a rather concentrated area.